PySpark Cheat Sheet
Source : github Pyspark-cheatsheet
This cheat sheet will help you learn PySpark and write PySpark apps faster. Everything in here is fully functional PySpark code you can run or adapt to your programs.
These snippets are licensed under the CC0 1.0 Universal License. That means you can freely copy and adapt these code snippets and you don't need to give attribution or include any notices.
These snippets use DataFrames loaded from various data sources:
- "Auto MPG Data Set" available from the UCI Machine Learning Repository.
- customer_spend.csv, a generated time series dataset.
- date_examples.csv, a generated dataset with various date and time formats.
- weblog.csv, a cleaned version of this web log dataset.
These snippets were tested against the Spark 3.2.2 API. This page was last updated 2022-09-19 15:31:03.
Make note of these helpful links:
- PySpark DataFrame Operations
- Built-in Spark SQL Functions
- MLlib Main Guide
- Structured Streaming Guide
- PySpark SQL Functions Source
Generate the Cheatsheet
You can generate the cheatsheet by running cheatsheet.py
in your PySpark environment as follows:
- Install dependencies:
pip3 install -r requirements.txt
- Generate README.md:
python3 cheatsheet.py
- Generate cheatsheet.ipynb:
python3 cheatsheet.py --notebook
Table of contents
- Accessing Data Sources
- Load a DataFrame from CSV
- Load a DataFrame from a Tab Separated Value (TSV) file
- Save a DataFrame in CSV format
- Load a DataFrame from Parquet
- Save a DataFrame in Parquet format
- Load a DataFrame from JSON Lines (jsonl) Formatted Data
- Save a DataFrame into a Hive catalog table
- Load a Hive catalog table into a DataFrame
- Load a DataFrame from a SQL query
- Load a CSV file from Amazon S3
- Load a CSV file from Oracle Cloud Infrastructure (OCI) Object Storage
- Read an Oracle DB table into a DataFrame using a Wallet
- Write a DataFrame to an Oracle DB table using a Wallet
- Write a DataFrame to a Postgres table
- Read a Postgres table into a DataFrame
- Data Handling Options
- Provide the schema when loading a DataFrame from CSV
- Save a DataFrame to CSV, overwriting existing data
- Save a DataFrame to CSV with a header
- Save a DataFrame in a single CSV file
- Save DataFrame as a dynamic partitioned table
- Overwrite specific partitions
- Load a CSV file with a money column into a DataFrame
- DataFrame Operations
- Add a new column to a DataFrame
- Modify a DataFrame column
- Add a column with multiple conditions
- Add a constant column
- Concatenate columns
- Drop a column
- Change a column name
- Change multiple column names
- Change all column names at once
- Convert a DataFrame column to a Python list
- Convert a scalar query to a Python value
- Consume a DataFrame row-wise as Python dictionaries
- Select particular columns from a DataFrame
- Create an empty dataframe with a specified schema
- Create a constant dataframe
- Convert String to Double
- Convert String to Integer
- Get the size of a DataFrame
- Get a DataFrame's number of partitions
- Get data types of a DataFrame's columns
- Convert an RDD to Data Frame
- Print the contents of an RDD
- Print the contents of a DataFrame
- Process each row of a DataFrame
- DataFrame Map example
- DataFrame Flatmap example
- Create a custom UDF
- Transforming Data
- Sorting and Searching
- Filter a column using a condition
- Filter based on a specific column value
- Filter based on an IN list
- Filter based on a NOT IN list
- Filter values based on keys in another DataFrame
- Get Dataframe rows that match a substring
- Filter a Dataframe based on a custom substring search
- Filter based on a column's length
- Multiple filter conditions
- Sort DataFrame by a column
- Take the first N rows of a DataFrame
- Get distinct values of a column
- Remove duplicates
- Grouping
- count(*) on a particular column
- Group and sort
- Filter groups based on an aggregate value, equivalent to SQL HAVING clause
- Group by multiple columns
- Aggregate multiple columns
- Aggregate multiple columns with custom orderings
- Get the maximum of a column
- Sum a list of columns
- Sum a column
- Aggregate all numeric columns
- Count unique after grouping
- Count distinct values on all columns
- Group by then filter on the count
- Find the top N per row group (use N=1 for maximum)
- Group key/values into a list
- Compute a histogram
- Compute global percentiles
- Compute percentiles within a partition
- Compute percentiles after aggregating
- Filter rows with values below a target percentile
- Aggregate and rollup
- Aggregate and cube
- Joining DataFrames
- File Processing
- Handling Missing Data
- Dealing with Dates
- Unstructured Analytics
- Pandas
- Convert Spark DataFrame to Pandas DataFrame
- Convert Pandas DataFrame to Spark DataFrame with Schema Detection
- Convert Pandas DataFrame to Spark DataFrame using a Custom Schema
- Convert N rows from a DataFrame to a Pandas DataFrame
- Grouped Aggregation with Pandas
- Use a Pandas Grouped Map Function via applyInPandas
- Data Profiling
- Data Management
- Save to a Delta Table
- Update records in a DataFrame using Delta Tables
- Merge into a Delta table
- Show Table Version History
- Load a Delta Table by Version ID (Time Travel Query)
- Load a Delta Table by Timestamp (Time Travel Query)
- Compact a Delta Table
- Add custom metadata to a Delta table write
- Read custom Delta table metadata
- Spark Streaming
- Connect to Kafka using SASL PLAIN authentication
- Create a windowed Structured Stream over input CSV files
- Create an unwindowed Structured Stream over input CSV files
- Add the current timestamp to a DataFrame
- Session analytics on a DataFrame
- Call a UDF only when a threshold is reached
- Streaming Machine Learning
- Control stream processing frequency
- Write a streaming DataFrame to a database
- Time Series
- Machine Learning
- Prepare data for training with a VectorAssembler
- A basic Random Forest Regression model
- Hyperparameter tuning
- Encode string variables as numbers
- One-hot encode a categorical variable
- Optimize a model after a data preparation pipeline
- Evaluate Model Performance
- Get feature importances of a trained model
- Plot Hyperparameter tuning metrics
- Compute correlation matrix
- Save a model
- Load a model and use it for transformations
- Load a model and use it for predictions
- Load a classification model and use it to compute confidences for output labels
- Performance _ Get the Spark version _ Log messages using Spark's Log4J _ Cache a DataFrame _ Show the execution plan, with costs _ Partition by a Column Value _ Range Partition a DataFrame _ Change Number of DataFrame Partitions _ Coalesce DataFrame partitions _ Set the number of shuffle partitions _ Sample a subset of a DataFrame _ Run multiple concurrent jobs in different pools _ Print Spark configuration properties _ Set Spark configuration properties _ Publish Metrics to Graphite * Increase Spark driver/executor heap space
Accessing Data Sources
Loading data stored in filesystems or databases, and saving it.
Load a DataFrame from CSV
See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html for a list of supported options.
df = spark.read.format("csv").option("header", True).load("data/auto-mpg.csv")
# Code snippet result:
+----+---------+------------+----------+------+------------+---------+------+----------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
Load a DataFrame from a Tab Separated Value (TSV) file
See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html for a list of supported options.
df = (
spark.read.format("csv")
.option("header", True)
.option("sep", "\t")
.load("data/auto-mpg.tsv")
)
# Code snippet result:
+----+---------+------------+----------+------+------------+---------+------+----------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
Save a DataFrame in CSV format
See https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html for a list of supported options.
auto_df.write.csv("output.csv")
Load a DataFrame from Parquet
df = spark.read.format("parquet").load("data/auto-mpg.parquet")
# Code snippet result:
+----+---------+------------+----------+------+------------+---------+------+----------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows
Save a DataFrame in Parquet format
auto_df.write.parquet("output.parquet")
Load a DataFrame from JSON Lines (jsonl) Formatted Data
JSON Lines / jsonl format uses one JSON document per line. If you have data with mostly regular structure this is better than nesting it in an array. See jsonlines.org
df = spark.read.json("data/weblog.jsonl")
# Code snippet result:
+----------+----------+--------+----------+----------+------+
| client| country| session| timestamp| uri| user|
+----------+----------+--------+----------+----------+------+
|{false,...|Bangladesh|55fa8213| 869196249|http://...|dde312|
|{true, ...| Niue|2fcd4a83|1031238717|http://...|9d00b9|
|{true, ...| Rwanda|013b996e| 628683372|http://...|1339d4|
|{false,...| Austria|07e8a71a|1043628668|https:/...|966312|
|{false,...| Belize|b23d05d8| 192738669|http://...|2af1e1|
|{false,...|Lao Peo...|d83dfbae|1066490444|http://...|844395|
|{false,...|French ...|e77dfaa2|1350920869|https:/...| null|
|{false,...|Turks a...|56664269| 280986223|http://...| null|
|{false,...| Ethiopia|628d6059| 881914195|https:/...|8ab45a|
|{false,...|Saint K...|85f9120c|1065114708|https:/...| null|
+----------+----------+--------+----------+----------+------+
only showing top 10 rows
Save a DataFrame into a Hive catalog table
Save a DataFrame to a Hive-compatible catalog. Use table
to save in the session's current database or database.table
to save
in a specific database.
auto_df.write.mode("overwrite").saveAsTable("autompg")
Load a Hive catalog table into a DataFrame
Load a DataFrame from a particular table. Use table
to load from the session's current database or database.table
to load from a specific database.
df = spark.table("autompg")
# Code snippet result:
+----+---------+------------+----------+------+------------+---------+------+----------+
| mpg|cylinders|displacement|horsepower|weight|acceleration|modelyear|origin| carname|
+----+---------+------------+----------+------+------------+---------+------+----------+
|18.0| 8| 307.0| 130.0| 3504.| 12.0| 70| 1|chevrol...|
|15.0| 8| 350.0| 165.0| 3693.| 11.5| 70| 1|buick s...|
|18.0| 8| 318.0| 150.0| 3436.| 11.0| 70| 1|plymout...|
|16.0| 8| 304.0| 150.0| 3433.| 12.0| 70| 1|amc reb...|
|17.0| 8| 302.0| 140.0| 3449.| 10.5| 70| 1|ford to...|
|15.0| 8| 429.0| 198.0| 4341.| 10.0| 70| 1|ford ga...|
|14.0| 8| 454.0| 220.0| 4354.| 9.0| 70| 1|chevrol...|
|14.0| 8| 440.0| 215.0| 4312.| 8.5| 70| 1|plymout...|
|14.0| 8| 455.0| 225.0| 4425.| 10.0| 70| 1|pontiac...|
|15.0| 8| 390.0| 190.0| 3850.| 8.5| 70| 1|amc amb...|
+----+---------+------------+----------+------+------------+---------+------+----------+
only showing top 10 rows